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ABSTRACT 

Estimating characteristics of large graphs via samphng is 
a vital part of the study of complex networks. Current 
sampling methods such as (independent) random vertex and 
random walks are useful but have drawbacks. Random ver- 
tex sampling may require too many resources (time, band- 
width, or money). Random walks, which normally require 
fewer resources per sample, can suffer from large estimation 
errors in the presence of disconnected or loosely connected 
graphs. In this work we propose a new m-dimensional ran- 
dom walk that uses m dependent random walkers. We show 
that the proposed sampling method, which we call Fron- 
tier sampling, exhibits all of the nice sampling properties 
of a regular random walk. At the same time, our simula- 
tions over large real world graphs show that, in the presence 
of disconnected or loosely connected components. Frontier 
sampling exhibits lower estimation errors than regular ran- 
dom walks. We also show that Frontier sampling is more 
suitable than random vertex sampling to sample the tail of 
the degree distribution of the graph. 
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1. INTRODUCTION 

A number of recent studies I71[ITl[TSl[T3[2ni[2Sl[3nill3 
135) (to cite a few) are dedicated to the characterization of 
complex networks. A complex network is a network with 
non-trivial topological features (features that do not occur 
in simple networks such as lattices or random networks). 
Examples of such networks include the Internet, the World 
Wide Web, social, business, and biological networks [7ll28|. 
This work represents a complex network as a directed graph 
with labeled vertices and edges. A label can be, for in- 
stance, the degree of a vertex or, in a social network setting, 
someone's hometown. Examples of network characteristics 
include the degree distribution, the fraction of HIV positive 
individuals in a population [21], or the average number of 
copies of a file in a peer-to-peer (P2P) network I16| . 

Characterizing the labels of a graph requires querying ver- 
tices and/or edges; each query has an associated cost in re- 
sources (time, bandwidth, money). Characterizing a large 
graph by querying the whole graph is often too costly. As a 
result, researchers have turned their attention to the estima- 
tion of graph characteristics based on incomplete (sampled) 
data. In this work we present a new tool. Frontier Sampling, 



to characterize complex networks. In what follows random 
vertex (edge) sampling leiexs to sampling vertices (edges) in- 
dependently and uniformly at random (with replacement). 

Distinct sampling strategies have different resource re- 
quirements depending on the network being sampled. For 
instance, in a network where each vertex is assigned a unique 
user-id (e.g., travelers and their passport numbers. Face- 
book, MySpace, Flickr, and Livejournal) it is a widespread 
practice to perform random vertex sampling by querying 
randomly generated user-ids. This approach can be resource- 
intensive if the user-id space is sparsely populated as the 
hit-to-miss ratio is low (e.g., less than 10% of all MySpace 
user-ids between the highest and lowest valid user-ids are 
currently occupied [3D]). Another way to sample a network 
is by querying edges instead of vertices. Randomly sam- 
pling edges can be harder than randomly sampling vertices 
if edges are not be associated to unique IDs (or if edge IDs 
cannot be randomly queried). We summarize some draw- 
backs of random vertex and edge sampling: 

• Random edge sampling may be impractical when edges 
cannot be randomly queried (e.g., online social net- 
works like Facebook [13], MySpace 30 , and Twitter 
or a P2P network like Bittorrent). 

• Random vertex sampling may be undesirable when 
user-ids are sparsely populated (low hit-to-miss ratio) 
and queries are subject to resource constraints (e.g., 
queries are rate-limited in Flickr, Livejournal [26J, and 
Bittorrent [TS])- In a P2P network like Bittorrent, a 
client can randomly sample peers (vertices) by query- 
ing a tracker (server); however, trackers may rate-limit 
client queries |18) . 

• Even when random vertex sampling is not severely 
resource-constrained, some characteristics may be bet- 
ter estimated with random edge sampling (e.g., the tail 
of the degree distribution of a graph). 

An alternative, and often cheaper, way to sample a network 
is by means of a random walk (RW) . A RW samples a graph 
by moving a particle (walker) from a vertex to a neighboring 
vertex (over an edge) . By this process edges and vertices are 
sampled. The probability by which the random walker se- 
lects the next neighboring vertex determines the probability 
by which vertices and edges are sampled. In this work we 
are interested in random walks that sample edges uniformly. 
The edges sampled by RW can then be used to obtain un- 
biased estimates of a variety of graph characteristics (we 
present two examples in Section Q. 



In this work we assume that a random walker has the 
abihty to query a vertex to obtain ah of its incoming and 
outgoing edges (Section 2] details the reason behind this as- 
sumption) . This is possible for online networks such as Twit- 
ter, Live Journal 26_, YouTube Facebook 151, MyS- 
pace [3D], P2P networks 29 , and the arXiv citations net- 
work. We revisit the theory behind random walks in Sec- 
tion g] 

Sampling graphs with random walks is not without draw- 
backs. The accuracy of the estimates depends not only on 
the graph structure but also on the characteristic being es- 
timated. The graph structure can create distortions in the 
estimates by "trapping" the random walker inside a sub- 
graph. An extreme case happens when the graph consists 
of two or more disconnected components (subgraphs). For 
instance, wireless mobile social networks exhibit connection 
graphs with multiple disconnected components 11 . But 
even connected graphs can suffer from the same problem. 
A random walker can get "temporarily trapped" and spend 
most of its sampling budget exploring the local neighbor- 
hood near where it got "trapped". In the above scenarios 
estimates may be inaccurate if the characteristics of the lo- 
cal neighborhood differ from the overall characteristic of the 
graph. This problem is well documented (see [U) and our 
goal is to mitigate it. 



Contributions 

This work proposes a new m-dimensional random walk sam- 
pling method (Frontier sampling) that, starting from a col- 
lection of m randomly sampled vertices, preserves all of the 
important statistical properties of a regular random walk 
(e.g., vertices are visited with a probability proportional to 
their degree). While the vertices are visited with a proba- 
bility proportional to their degree, we show that the joint 
steady state distribution of Frontier Sampling (the joint dis- 
tribution of all m vertices) is closer to uniform (the starting 
distribution) than that of m independent random walkers, 
for any m > 0. This property has the potential to dramati- 
cally reduce the transient of random walks. 

In our simulations using real world graphs we see that 
Frontier Sampling mitigates the large estimation errors caused 
by disconnected or loosely connected components that can 
"trap" a random walker and distort the estimated graph 
characteristic, i.e.. Frontier sampling (FS) estimates have 
smaller Mean Squared Errors (MSEs) than estimates ob- 
tained from regular random walkers (single and multiple in- 
dependent walkers, reviewed in Section [4.41) in a variety of 
scenarios. 

We make two additional contributions: (1) we compare 
random walk-based estimates to those obtained from ran- 
dom vertex and random edge sampling. We show analyt- 
ically that the tail of the degree distribution is better es- 
timated using random edge sampling than random vertex 
sampling. We observe from simulations over real world net- 
works (in Section 16. 4|) that FS accuracy is comparable to 
the accuracy of random edge sampling. These results help 
explain recent empirical results [55]; (2) we present asymp- 
totically unbiased estimators using the edges sampled by a 
RW for the assortative mixing coefHcient (defined in Sec- 
tion I4.2.2P and the global clustering coefficient (defined in 
Section gn)) . 



Outline 

The outline of this work is as follows. Section [2] presents 
the notation used in this paper. Section [3] contrasts ran- 
dom vertex with random edge sampling. Section |4] revisits 
single and multiple independent random walk sampling and 
estimation. Section [5] introduces Frontier Sampling (FS), a 
sampling process that uses m dependent random walkers in 
order to mitigate the high estimation errors caused by dis- 
connected or loosely connected components. Section [5] also 
shows that FS can be seen as an m-dimensional random 
walk over the m-th Cartesian power of the graph (formally 
defined in Section [SJ. In Section [S] we see that FS outper- 
forms both single and multiple independent random walkers 
in a variety of scenarios. We also compare (independent) 
random vertex and edge sampling with FS. Section [7] re- 
views the relevant literature. Finally, Section [8] presents our 
conclusions and future work. 

2. DEFINITIONS 

In what follows we present some definitions. Let Gd = 
{V, Ed) be a labeled directed graph representing the (origi- 
nal) network graph, where is a set of vertices and Ed is a 
set of ordered pairs of vertices (u, v) representing a connec- 
tion from u to II (a.k.a. edges). We assume that each vertex 
in Gd has at least one incoming or outgoing edge. The in- 
degree of a vertex u in Gd is the number of distinct edges 
(wi,u), . . . , {vi,u) into u, and its out-degree is the number 
of distinct edges {u,vi), . . . , {u,Vj) out of u. Some com- 
plex networks can be modeled as undirected graphs. In this 
case, when the original graph is undirected, we model Gd as 
a symmetric directed graph, i.e., V(ii, u) G Ed, {v,u) € Ed- 
Let jCv and £e be a finite set of vertex and edge labels, 
respectively. Each edge {u, v) G Ed is associated with a set 
of labels Ce{u, v) <Z jOs- For instance, the label of edge (u, v) 
can be the in-degree of v in Gd- Similarly, we can associate a 
set of labels to each vertex, Cv{v) C Vv £ V. Some edges 
and vertices may not have labels. If edge (u, v) is unlabeled 
then Ce{u,v) — III. Similarly, if vertex v is unlabeled then 
C4^)^9- 

When performing a random walk, we assume that a ran- 
dom walker has the ability to retrieve incoming and outgoing 
edges from a queried vertex (and vertices are distinguish- 
able). With this assumption we are able to build (on-the- 
fly) a symmetric directed graph while walking over Gd- Let 
G — {V,E) be the symmetric counterpart of Gd, i.e., 

E^ [j {{u,v),{v,u)}- 

Note that G may not be connected. As G is symmetric, we 
denote by deg(?;) to be the in-degree or the out-degree of 
V £ V as they are equal. Let vo^S") = X^vugs '^^s('")' — 
V, denote the volume of the vertices in 5*. 

Let Oi be the estimated fraction of vertices with label I 
obtained by some estimator. The two error metrics used in 
most of our examples are the normalized root mean square 
error of Oi , which is a normalized measure of the dispersion 
of the estimates, defined as 

NMSE(0 = ^»^. (1) 

and the normalized root mean square error of the Com- 
plementary Cumulative Distribution Function (CCDF) 7 = 



{7i}, where 7; = J2T^i+i ^k, defined as 



CNMSE(0 = 



- fi) 



7i 



(2) 



For the sake of simplicity, and unless stated otherwise, in 
the remainder of this paper we assume that all queries of 
edges and vertices have unitary cost and that we have a 
fixed sampling budget B. 

3. VERTEX V.S. EDGE SAMPLING 

We consider a straightforward estimation problem to il- 
lustrate a tradeoff between random edge and random ver- 
tex sampling. Consider the problem of estimating the out- 
degree distribution of Gd- Let Oi be the fraction of vertices 
with out-degree i > and d be the average out-degree. Let 
the label of vertex u, Cv{u), be the out-degree of u. We as- 
sume that d is known; also assume that from an edge (u, v) 
we can query £„(w). In random edge sampling the proba- 
bility of sampling a vertex with out-degree i is proportional 
i: TVi = i6i/d. On the other hand, random vertex sam- 
pling samples a vertex with out-degree i with probability 9i. 
A straightforward calculation shows that the NMSE (equa- 
tion HI) of _B randomly sampled edges with out-degree i 



NMSE(i) = ^(I/tt, - 1)/B, i > 0. (3) 
Similarly, the NMSE(i) for random vertex sampling is 



NMSE(i) = y^(T/e~^l)/B . 



(4) 



Now note that m/di = i/d, which means that tt; > 6i if 
i > d and iVi < 6i ii i < d. From equations (O and Q we 
see that random edge sampling more accurately estimates 
degrees larger than the average {i > d) while random vertex 
sampling more accurately estimates degrees smaller than the 
average {i < d). This means random edge sampling exhibits 
smaller NMSE when estimating the tail of the out-degree 
distribution. 

Above we have seen that random edge sampling is more 
accurate than random vertex sampling in estimating the tail 
of the out-degree degree distribution. A similar result hap- 
pens with the in-degree distribution and the degree distri- 
bution of undirected networks. The above analysis explains 
real world experiments Unfortunately, as discussed in 

Section[Tl random edge sampling is rarely practical. In what 
follows we see that, if G is connected, random walks exhibit 
similar statistical properties to random edge sampling. 

4. RANDOM WALK SAMPLING 

In this section we review random walk (RW) sampling and 
estimation over a non-bipartite, connected, directed, sym- 
metric graph G. Sampling G with a RW is straightforward. 
The random walker has a sampling budget B and starts at 
vertex vo £ V. For the sake of simplicity, unless stated oth- 
erwise, we consider that all queries to vertices have unit cost 
and that we have a fixed sampling budget B. 

Let {{ui,Vi)}f^i be the a sequence of edges sampled by a 
RW, where Ui = Vi^i, i = 2, . . . ,B. Note that edges may 
be sampled multiple times. We refer to {ui,Vi) as the i-th 
sampled edge. At the i-th step a walker at vertex Vi chooses 
an outgoing edge {vi,Ui) uniformly at random from the set 
of outgoing edges of Vi and adds {vi,Ui) to the sequence of 



sampled edges. At step i + 1 the random walker starts at 
vertex Ui and the sampling continues until i — B. 

The RW described here is the most common type of RW 
found in the literature F25j. Other types of random walks 
differ in the way in which outgoing edges are sampled. The 
Metropolis-Hastings RW 37j is an example of a random 
walk that samples vertices (not edges) uniformly at random. 
However, experiments estimating a variety of metrics indi- 
cate that Metropolis-Hastings RW is less accurate than the 
random walk described in this work |15l I29| . For more de- 
tails about other types of RW please refer to [311 Chapter 7] . 

An important property of a RW is its ability to reach a 
unique stationary regime. A necessary condition for sta- 
tionarity is that G must be symmetric, connected, and non- 
bipartite (the non-bipartite assumption can be relaxed in 
a lazy random walk [22] )• In a stationary RW, the se- 
quence of sampled edges is a stationary sequence. A se- 
quence Xi,X2, ... of random variables is said to be station- 
ary if for any positive integers n and k, the joint distribu- 
tion of {X„, . . . ,X„+k) is independent of n. Once the RW 
reaches steady state, it also shares two important properties 
with random edge (RE) sampling. First, both RW and RE 
sample edges uniformly at random [251, which means that 
the probability that a vertex v is sampled is deg{v) / vol{V) . 
Second, both RW and RE obey the strong law of large num- 
bers, as we see next. 

4.1 Strong Law of Large Numbers 

The following variation of the strong law of large num- 
bers is a powerful tool to build (asymptotically) unbiased 
estimators of graph characteristics. We provide a trivial 
extension of a well known result [251 Theorem 17.2.1] to 
the case where we are interested in a subset of the graph 
edges. Let E* <Z E he non-empty. Let {ui,Vi) be the i-th 
RW sampled edge such that {ui,Vi) £ E*; and let B*{B) 
be the number of such samples, where B is the number of 
RW steps. B*{B) is a random variable that represents the 
number of RW sampled edges that belong to E* . Note that 
B*{B) < B. 

Theorem 4.1 (SLLN). For any function f, where 



lim „ , 
s^oo B*{B) 



1 I 

1=1 v{u,D)eB* 
almost surely, i.e., the event occurs with probability one. 
Proof. Let 

'1 if {u,v) G E* , and 



h(u, v) — 



otherwise. 



As the RW is stationary (and edges are sampled uniformly) 



lim 



almost surely [251 Theorem 17.2.1]. The proof follows from 
noting that B*{B) — '^-^-^h{ui,Vi) and that h(u,v) = 
0,y{u,v) € E\E\ □ 



Theorem [4T] allows us to construct estimators of graph char- 
acteristics that converge to their true values as the number 



of RW samples goes to infinity (B — > oo). If we are try- 
ing to estimate vertex labels we set E* = E and B* — B. 
In what follows we apply Theorem 14.11 to estimate graph 
characteristics; we also present four examples of estimators. 

4.2 Estimators 

An estimator is a function that takes a sequence of obser- 
vations (sampled data) as input and outputs an estimate of 
a unknown population parameter (graph characteristic). In 
this section we see how we can estimate graph characteristics 
using the edges sampled by a RW. 

We present estimators of the following four graph charac- 
teristics: the edge label density (the fraction of edges with 
a given label in the graph), the assortative mixing coeffi- 
cient [27], the vertex label density, and the global clustering 
coefficient (331. Designing these estimators is straightfor- 
ward: 

(1) First we find a function / that computes the charac- 
teristic of G using E; 

(2) then we replace E with the sequence of edges sampled 
by a stationary RW. 

In what follows we illustrate how to build an estimator of 
the edge label density. 

4.2. 1 Edge Label Density 

We seek to estimate the fraction of edges with label I £ Ce 
in Gd among all edges (u, v) that have labels, i.e., Ce{u, v) 7^ 
0. Edge labels can be anything, from social networking la- 
bels to the amount of IP traffic over each link in a com- 
puter network. An edge label can be, for instance, a tuple 
(outdeg(«), indeg(i;)) where outdeg(u) is the out-degree of u 
and indeg(ii) is the in-degree of v in the original graph Gd- 

For now we assume that we know E. Let E* be the non- 
empty subset of E for which there are labels. Let pi denote 
the fraction of edges in E* with label /; it is clear that 



Pi = 



1{1 e Ce{u,v)) 



where 



1{1 e Ce{u,v)) = 







\E* 



if I G £e{u, v) 
otherwise. 



Let B*{B) be the number of RW sampled edges that belong 
to E* and {ui,Vi) be the i-th of such edges. Replacing E* 
with the edges in E* sampled by a stationary RW gives the 
following estimator 



Pi 



B*(B) 
^ 1{1 G Ce{Ui 

i = l 



«0) 



(5) 



It follows directly from Theorem 14.11 (with f(u,v) = 1{1 £ 

Ce{u,v))) that VaaB^aaPi ^ Pi- Moreover, from the lin- 
earity of expectation, E[pi\ = pi for all values of B*{B) > 0. 

4.2.2 Assortative Mixing Coefficient 

The assortative mixing coefficient [27] is a measure of the 
correlation of labels between two neighboring vertices. By 
appropriately assigning edge labels derived from vertex la- 
bels, we can use the density estimator of equation Q to 
derive an estimator of the assortative mixing coefficient. In 



order to simplify our exposition, we restrict our analysis to 
the assortative mixing of vertex degrees in a directed graph 
(equation (25) of [27]). It is trivial to extend our analysis 
to other types of assortative mixing coefficients, e.g., equa- 
tions (21) and (23) of ,27. 

Let (outdeg(M), indeg(w)) denote the label of a directed 
edge (u, v) in G that also exists in Gd', and let E* be the set 
of all such edges {E* — Ed)- Let pij denote the fraction of 
labeled edges with label Let Wout (W^in) denote the 

maximum observed out-degree (in-degree) of Gd in the RW. 
The degree assortative mixing coefficient [27] of a directed 
graph can be estimated using 

Wout Win 

r = - — : > > ^J{P^j - Qi Qj , 



i=0 j=Q 



where 



Pij 



B*{B) 

^ l(outdeg (nfc) = i , indeg(i;fc) = j) 
fc=i 



B*{B) 



/.out \ ^ ^ 

<ii = y ^Pik 



Pkj 



O-in = 




where o-jn and (Tout are the standard deviation of the distri- 



bution gj" 



*). As the estimate Pij (equation ([SJ) asymp- 



totically converges almost surely to its true value, it is trivial 
to show that qf^, 'Ij'^*', ^in, and (Tout also asymptotically con- 
verge almost surely to their true values. Thus, f asymptoti- 
cally converges, almost surely, to the true assortative mixing 
coefficient of 27 , as long as ain > and (Tout > 0. This im- 
plies that f is an asymptotically unbiased estimator of the 
assortative mixing coefficient of Gd- 

4.2.3 Vertex Label Density 

Let jCv{v) be the set of labels associated with vertex v, Vu G 
V. The fraction of vertices with label / in G, 9i, is 

1 ^ i(; ££„(")) 



deg(i;) 



(6) 



as G = {V, E) is directed and symmetric. By replacing E 
with a sequence of edges sampled by a stationary RW (here 
we have E* — E and B* — B) and renormalizing, we arrive 
at the following estimator for 61 



—y 

SB ^ 



deg(u,) 



(7) 



where S = 1/-B X^^j^ 1/ deg(i;i) . From Theorem l4.1l we have 
lima ^00 5 — ^ almost surely. Using again Theo- 

rem |3?T] we have 



j.^^ 1 1(;££„(^0) 

2 = 1 



deg(i;i 



\E\ 



E 

v{u,u)e-E 



!(/ £ L^{v)) 
deg(w) 



almost surely, which divided by yields equation 

As S converges almost surely to |V^|/|_E|, we have lims^.oo 9i - 
di, almost surely. This also implies that 9i is an asymptoti- 
cally unbiased estimator of 6i . 

4.2.4 Global Clustering Coefficient 

In the literature the term clustering coefficient often refers 
to the local clustering coefficient 36 . In our example we 
estimate a different metric: the global clustering coefficient. 
In a social network the global clustering coefficient, C, is 
the probability that the friend of John's friend is also John's 
friend [33] ■ Let V* be the set of vertices v £V with deg(w) > 
1. The global clustering coefficient of an undirected graph 
is defined as 1331 



where 



^ rA(^)/(<^^|(")) if deg(^) > 



(8) 



otherwise , 



where A(t;) = \{{u,'w) & E : {v,u) E and (v,w) G is 
the number of triangles that contain vertex v and (^'^'^^'"'^ 
is the maximum number of triangles that a vertex v with 
degree deg(u) can belong to. 

Note that finding A{v) for a given vertex v £ V requires 
knowing all vertices within two hops of v, which can be a 
resource intensive task. To avoid the cost of computing A, 
we rewrite equation ((§]) 



C : 



1 

y{v,u)eE 



\V*\ ^ (■dcg{«)^ 



where /(i), it) gives the number of shared neighbors between 
u and V. 

Let (vi,Ui) be the i-th sampled edge in a stationary RW 
and let 



SB 



^ (dcgK)) deg(wi) 



where 



B 



B ^ deg(wi) 

i — 1 



Corollary 4.2. lim_B_s.oo (7— >C , almost surely. 
Proof. From Theorem 14.11 

lim S^\V*\/\E\, 

almost surely. Also from Theorem 14.11 



lim 1 

B->oc B ^ CdGg{ui)\ deg(wi 

i—\ V 2 / 



\E\ 2^ 



v(tj,u)e£; 



^dcg(t))^ ■ 



almost surely, which together with the almost sure conver- 
gence of S implies that lims-»oo C^C, almost surely. □ 

Note that almost sure convergence implies that C is an 
asymptotically unbiased estimator of C. 



4.3 Estimator Accuracy & Graph Structure 

Sampling a graph using a RW is not without drawbacks. 
A random walker can get (temporarily) "trapped" inside a 
subgraph whose characteristics differ from those of the whole 
graph. Even if the random walker starts in steady state (i.e., 
is stationary), this scenario may increase the mean squared 
error of the estimates. If the random walker does not start 
in steady state, this scenario may cause an increase in the 
estimation bias as well as the mean squared error. Ideally, 
the random walker needs to mitigate the effect of these traps 
on the estimates. 

The above two types of estimation errors are well doc- 
umented in the literature and various solutions are avail- 
able [13]. For instance, if the random walker does not start 
in a stationary regime (transient), it is common practice to 
discard the first w samples 14^ . The value of w is called the 
burn-in period. There are two problems with this solution: 
(1) it only reduces the error related to the non-stationarity 
of the samples; (2) it is difficult to determine a good value 
for w if the sampling budget is small (compared to the size 
of the graph) and the size and structure of G are unknown. 

A simple naive solution to the RW "trapping" problem 
(adopted in 15 to sample Facebook), is to sample the graph 
using multiple independent random walkers [14) . In what 
follows we see that this naive approach can lead to increased 
estimation errors. In Section[S]we propose a method to mit- 
igate the random walk "trapping" problem using m depen- 
dent random walkers. 

4.4 Multiple Independent Random Walkers 

The main problem of estimating graph characteristics us- 
ing a single walker is that the walker may get trapped inside 
a local neighborhood. But there is the question of what 
happens if we could start m independent random walkers 
[MultipleRW) at m independently sampled vertices in the 
graph. Note that when m = 1 we are back to sampling G 
using a single random walker, which we denote as SingleRW. 
Networks such as MySpace, Facebook, and Bittorrent admit 
random (uniform) vertex sampling at cost c higher than the 
cost of sampling the neighbors of a known vertex (which is 
what a RW does) . In such networks random vertex sampling 
may help us start m random walkers at different parts of the 
graph. While the value of c can be large, initializing m ran- 
dom walkers with uniformly sampled vertices costs (only) 
mc units of our sampling budget (where one unity of the 
budget is the cost of sampling a vertex in a RW). 
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Figure 1: (Flickr) The log-log plot of the CNMSE of the in-degree 
distribution estimates with budget B = |y|/10. 

Unfortunately, m independent random walkers starting at 



m randomly sampled vertices may decrease estimation ac- 
curacy. Consider the following experiment where each of 
the m random walkers (independently) performs [B/m — cj 
steps. We seek to estimate the CCDF (complementary cu- 
mulative distribution function) of the in-degree of the Flickr 
graph (the Flickr dataset is summarized in Table[T]). Accord- 
ing to Table [T] the Flickr graph is disconnected. The goal 
of this simulation is to compare the estimation accuracy of 
SingleRW and MultipleRW when there are no disconnected 
components. For this we set c = 1. The sampling budget 
is B = 171,525 = \V\/1Q, which amounts to a sampling 
budget equivalent to 10% of the vertices in the graph. Fig- 
ure [T] shows a log- log plot of the CNMSE, equation of 
SingleRW and MultipleRW (m = 10) averaged over 10, 000 
runs. Note that the estimates obtained by SingleRW are, 
on average, more accurate than the estimates obtained by 
MultipleRW. Increasing the sampling budget B does not re- 
duce the gap. In Section [S] we see, over other real-world 
graphs, that when starting random walkers from uniformly 
sampled vertices, MultipleRW has higher estimation errors 
than SingleRW. 

4.5 Disconnected Graph Example 

The following example shows a situation in which both 
MultipleRW and SingleRW have large estimation errors. In 
this example we initialize MultipleRW with m randomly 
(uniformly) sampled vertices. We simplify our exposition by 
assuming that each MultipleRW walker takes B/m steps, 
where B (the sampling budget) is a multiple of m. Let 
G = {V,E) be an undirected graph that has two large dis- 
connected components Ga ~ {Va,Ea) and Gb = {Vb,Eb)- 
Let \Va\ = \Vb\ and voI(Va) > voI(Vb). When initial 
vertices are uniformly sampled, the probability that each 
MultipleRW walker (independently) starts in Ga (Gb) is 
hA = \Va\/\V\ {Hb = \Vb\/\V\). RecaU that Ga and Gb are 
disconnected. For each random walker, after B/m {B ^ 1) 
RW steps, an edge {ua,va) £ Ea is sampled with probabil- 
ity PA ~ ^A/vol(yA). Similarly an edge {ub,vb) £ Eb is 
sampled with probability pb ~ fts /vol(Vs). Thus pA < Pb, 
i.e., the edges in Gs are sampled with higher probability 
than the edges in Ga- As our estimators assume that all 
edges are sampled with the same probability, this imbal- 
ance between pA and pb has the potential to introduce large 
MSEs (and biases) . Note that increasing m does not change 
PA and Pb- Increasing B only mitigates this problem if G 
is connected and, in a loosely connected graph, only large 
values of B positively impact the MSE. Ideally we want a 
RW algorithm that does not rely on large sampling budgets 
B to achieve low estimation errors. 

Now consider the same thought experiment where each 
random walker starts in Ga and Gb (independently) with 
probabilities Ha = vol(yA)/vol(V') and /is = vo1(Vb)/vo1(V), 
respectively. In this new scenario it is easy to see that 
PA = Pb = l/vo\(V) = l/vol(V^). Thus, we would like 
to start a RW at vertex v with probability deg(v) /vo\{V), 
Vn G V- Section [5] shows that in practice this approach 
can successfully mitigate estimation errors caused by discon- 
nected components. Unfortunately, it is difficult to sample 
m mutually independent vertices with probabilities propor- 
tional to their degrees. In the case where G is connected, this 
is equivalent to jointly start m independent random walkers 
in steady state. In networks such as MySpace, Facebook, 
and Bittorrent it is unclear how one can (efficiently) sample 



vertices with probabilities proportional to their degrees. 

We want an m-dimensional random walk that, in steady 
state, samples edges uniformly at random but, unlike 
MultipleRW, can benefit from starting its walkers at uni- 
formly sampled vertices. 

5. FRONTIER SAMPLING (FS) 

In this section we present a new and promising approach 
to an m-dimensional random walk that benefits from start- 
ing its walkers at uniformly sampled vertices. Frontier Sam- 
pling (FS) performs m dependent random walks in the graph. 
We refer to m as the dimension of the FS random walk. 
Let c be the cost of randomly sampling a vertex. The FS 
algorithm, given in Algorithm [T] is a centrally coordinated 
sampling algorithm that maintains a list of m vertices rep- 
resenting m random walkers. This way FS is less likely to 



Algorithm 1: Frontier Sampling (FS). 

1: n <— {n is the number of steps} 

2: Initialize L = («i, . . . , Vm) with m randomly chosen 

vertices (uniformly) 
3: repeat 

4: Select u £ L with probability deg(it)/ X^^usl <ieg(w) 
5: Select an outgoing edge of u, (u,v), uniformly at 
random 

6: Replace u by u in L and add (u, v) to sequence of 

sampled edges 
7: n 4- n + 1 
8: until n > B — mc 



get stuck in loosely connected components than a single ran- 
dom walker. However, in Section 15.21 we see that the joint 
steady state distribution of FS is much closer to the uniform 
distribution than is the steady state distribution of m inde- 
pendent random walkers. Section [5.31 describes how the FS 
algorithm can be made fully distributed. In Section [6] we 
see that, if the initial set of random walk vertices is chosen 
uniformly at random, FS estimates are more accurate than 
both single and m independent random walkers. 

Frontier Sampling: An m-dimensional Random Walk 

FS shares many of the same statistical properties of a sin- 
gle random walker. The key insight behind Theorem 15.21 
below is that the FS stochastic process is equivalent to the 
stochastic process of a single random walker over the m-th 
Cartesian power of G, G™ — {V'^,Em), where 

v™ = {(«!,...,«„) jt;i ev ^v^ev} 

is the m-th Cartesian power of V and Vv, u G V", (v, u) € 
Em if exists an index i such that (vi,Ui) G E and Uj = Vj 
for j 7^ i. 

Lemma 5.1. The Frontier sampling process is equivalent 
to the sampling process of a single random walker over G™ . 

Proof. Consider the (n — l)-st step of FS. The reader 
may find Figure [5] helpful in following the proof. Let L„ = 
(«!, . . . , Vm) be the state of FS before the n-th step. Clearly 
Ln G - Let e(L„) denote the collection of all edges as- 
sociated to the vertices in Ln- We refer to e{L„) as the 
edge frontier at the n-th step. We describe the transition 
from state Ln to state Ln+i as follows (lines ((U and (O of 
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Figure 2: Illustration of the Markov chain associated to the Fron- 
tier sampler with dimension m = 2. 



the FS algorithm): Select a vertex v £ L„ with probability 
proportional to deg(u) and then replace vertex v in L„ with 
one of its neighbors (selected uniformly at random). This is 
equivalent to randomly sampling an edge from e(L„) with 
probability 



Therefore, I/„ transits to state Ln+i iff {Ln, L^+i) G Em 
and the transition probability from Ln to L„+i is l/|e(L„)|. 
Thus, the Markov chain that describes FS is equivalent to 
the Markov chain of a single random walker over G™. □ 

Theorem 5.2. Recall that G is a directed symmetric graph. 
If G is connected and non-bipartite, then in steady state FS 
has the following properties: 

(I) edges are sampled uniformly at random, and form a 
stationary sequence, 

(II) has distribution. Loo = {vi, . . . ,Vm), equal to 

m|l/j™-ivolCl/) ' ' 
which is unique, and 

(III) the sequence of sampled edges satisfies the Strong Law 
of Large Numbers (Theorem \4-l\ l- 

Proof. The proof is found in Appendix[Al □ 

In Section[53]we observed that, when starting multiple RWs, 
the MSB is reduced when the number of walkers inside each 
subgraph matches the number obtained when the graph is 
connected and all walkers are in steady state. In what fol- 
lows we see that, in steady state, the average number of 
MultipleRW walkers in Va is far from the average number 
obtained with m uniformly sampled vertices. In contrast. 
Section [5.21 shows that as m — > oo, by uniformly sampling 
the starting vertices, FS starts in steady state with respect 
to the number of random walkers in any subset of vertices 
Va C V. 

5.1 MultipleRW Steady State v.s. Uniform Dis- 
tribution 

Consider a MultipleRW process with m walkers and let 
Kmw {m) be a random variable that denotes the steady state 
number of MultipleRW random walkers in Va- Let a a = 
E[Kmwirn)]/ E[Kun{'m)] be the ratio between the steady 



state number of MultipleRW in Va and the number of ran- 
dom walkers that start in Va from uniformly sampled ver- 
tices. As all random walkers are independent we have 



E[Kmw{m)] = 
It is also easy to see that 

ElKur^im)] 
From the above we have 



m\VA\ 

\v\ ■ 



CtA 



E[Kmvjim)]/E[Kun{m)] = dA/d. 



Note that the value of may be quite large or close to zero 
depending on both (1) the choice of Va and (2) the average 
degree of G. 

5.2 FS Steady State v.s. Uniform Distribution 

Let G — {V, E) be a connected graph and Va C V be a 
proper subset of V; define Vb = V\Va- Let cJa = vo1(Va)/|Va|, 
ds = YoXiVB) /\Vb\, and d — vol(V)/jV| be the average de- 
grees of the vertices in Va, Vb, and V, respectively. Consider 
a FS process with m walkers and let Kfs{m) be a random 
variable that denotes the number of random walkers in Va in 
steady state. Let Kunim) be a random variable that denotes 
the number of sampled vertices, out of m uniformly (ran- 
domly) sampled vertices from V, that belong to Va- Kun(m) 
has distribution P[7t"„„(m) = fc] = (™)/(l - p)'""*", Vfc > 
, where p = IVaI/I V|. In this section we show that Kfs{m) 
and Kun{m) converge to the same limiting distribution, i.e., 

lim P[Kfs{m) = k] = lim P[A'„„(m) = fc], Vfc > 0. (9) 

m—^oo m— >oo 

Recall that the FS algorithm starts m random walkers at m 
uniformly sampled vertices (sampled independently). Let 
Va CI V. As m increases, eq. ((9|, the number of FS walkers 
that are initially selected to be in Va approaches the steady 
state distribution (assuming G is connected). 

Let L G V™ be the state of FS; from (Theorem we 
have 



P[L^{vr,- 



1 deg(^J.) 
m|V|™-ivol(V)' 



In the following lemma we find the probability that Kfs (m) — 
k, < k < m. 

Lemma 5.3. 

P[Kfsim) = k] = -LQ/(l-p)'"-''(fcdA + (m-fc)ds), 

where p — |Va|/|V| and < fc < m. 

Proof. P[Kfs{m) — k] is the sum of the probabilities 
P[L = {vi,---,Vm)] over all states L in which exactly k 
vertices belong to Va- Consider Vi, the i-th element of L. 
When Vi € Va, the contribution of Vi towards P[Kfs(m) = 
k] is 

|VAl'=->sr-*=deg(i;0/(m|Vr-Vol(V)); 

when Vi G Vs, the contribution of Vi towards P[Kfs{m) = fc] 
is 

\VA\''\VBr-''~'degiv,)/{m\Vr-\ol{V)). 



Graph 



Flickr LiveJournal YouTube Internet RLT 



Description 
Type of graph 

# of Vertices 
Size of LCC 

# of Edges 
Average Degree 

^max 

% of Original Graph 



Social Net. Social Net. Social Net. Internet tracert. 



Directed 
1,715,255 
1,624,992 
22,613,981 
12.2 
2232 
26.9% 



Directed 
5, 204, 176 
5, 189, 809 
77, 402, 652 
14.6 
1029 
95.4% 



Directed 
1,138,499 
1, 134, 890 
9, 890, 764 
8.7 
3305 
NA 



Directed 
192, 244 
609, 066 
609, 066 
3.2 
335 
NA 



Table 1: Summary of the graph datasets used in our simulations, 
and uiniax is the value of the largest vertex degree divided by the 

Summing over all elements in L and over all vertices yields 

IVAl^lVBr-" 



P[Kfs{m) 



"Size of LCC" refers to the size of the largest connected component 
average degree. 

the de Moivre- Laplace limit theorem shown in 112', pg. 193] 
yields 



k J m|V|"'-ivol(\/) 



E E 



\Va\ 



E E 



deg(tj 

\Vb\ 



^ ( ] (1 - pT'^ikdA + (m - k)dB) 



(10) 



□ 



The previous lemma gives the probability that a subset of 
vertices Va has Kfgijn) G {0, ...,m} FS random walkers. 
The following theorem shows that Kfs{m) and Kun{rn) con- 
verge to the same limiting distribution. 

Theorem 5.4. 

\imm^ oo P[Kfs{m) = k] — limm^oo PlKunim) = k], VA; > 
0. 

Proof. From Lemma [Ol 

P[Kfs{m) = k] = — '—P[Kur,{m) = k]. (11) 

Note that if fc = mp we have 

vol(V4) , voI(Vb) 
mpdA + (m — mp)dB ^ |v| ' \v\ 



md md 

As m — > oo, the probability mass of P[Kun{jn) — k] gets 
highly concentrated around the interval k G [mp—c^/m, mp+ 
c^Jrti\^ for large enough values of c. Let k~ (m) = mp — 



z{m)y^mp{l — p) and k^ (m) — mp + z(ra)^ mp{l — p), 
where z{m) = o(^/mj3 is a slow increasing function of m. 
Note that 



z{m)^/m^/p{l - p){dA - rfs) 
md 

and, thus, eq. (|12p yields 



o(m*''^)/m 



and 



Um ^ {•m)dA + {m — k (m))dfl _ ^ 
m d 



k (m) dA + (m — k (m))dB 
lim ^ — ^ — ■ ^ — — = 1. 



m — >oo 



md 



All that is left to show is that limm-Kxj P[Kun{m) < k (m)] 
and limm-»oo P[Kun{m) > fc"*"] = 0. Using an extension of 

^ f{m) — o{h{m)) implies limm->oo f{m)/h(m) — 0. 



lim P[Kun{m) < k (m)] — , and 



m — ^oo 



lim P[Kunim) > fc+(m)] = 0. 
Putting together eqs. HT]), GSl), Gil, and {TS)), with 



(15) 



k{m) dA + {m - k{m))dB , , , 

hm ^ — ^ — ; ^ — — < oo , k[m) — o(m) , 



md 



yields 



lim P[Kfs{m) = k] = lim P[A'„„(m) = fc], Vfc > 0, 

which concludes our proof. □ 

We have seen as m gets larger, FS gets closer to starting 
in steady state with respect to the number of FS random 
walkers inside Va, VVa C V. 

We have seen that if we initialize m random walkers with 
uniformly sampled vertices, FS starts closer to steady state 
than MultipleRW. In what follows we show that FS is well 
suited to be used in large scale (parallel, asynchronous) ex- 
periments without incurring in any coordination or commu- 
nication costs between the random walkers. 

5.3 Distributed FS 

FS is well suited to be used in large scale (parallel, asyn- 
chronous) experiments. Let B be the budget of FS. In the 
distributed version of FS the budget is not directly related to 
the number of sampled vertices obtained by the algorithm. 
This is because distributed FS is achieved using multiple in- 
dependent random walkers where the cost of sampling a ver- 
tex V is an exponentially distributed random variable with 
parameter deg(w). In what follows we show, using the Uni- 
formization principle of Markov chains [HI Chapter 7.5] and 
the Poisson decomposition property, that FS can be made 
fully distributed. 



Theorem 5.5. A MultipleRW sampling process where the 

(13) cost of sampling a vertex v is an exponentially distributed 
random variable with parameter deg(?;) is equivalent to a 
FS process. 

(14) Proof. Consider the following Distributed FS (DFS) pro- 
cess. Let X = {L{r) G ; r £ R*} be the Markov chain 

_ associated with a random walker over G"" = {V"^, Em), the 



m-th Cartesian power of G, with transition rate matrix 
Q = A-D, 



where A is the adjacency matrix of G™, Ai,j G {0, 1}, Vi, j, 
and D is a diagonal matrix with Di^i = We 
observe this FS process over the interval [0, B]. 

In the DPS process, the probability that the fc-th random 
walker transitions out of vertex Vk at step r + A depends 
only on deg(Dfc) and not on the state of L{t). Thus, we can 
decompose the Poisson process describing a departure from 
the state L'„ = {vi, . . . , Vm) into m independent stochastic 
processes, where the i-th process is a Poisson process with 
parameter — deg(i;i) , i = 1, . . . , m. The above is equiva- 
lent to the stochastic process of a MultipleRW process with 
m random walkers and budget B, where the cost of sampling 
a vertex v is an exponentially distributed random variable 
with rate deg(ii). 

The DPS is equivalent to a PS process via the Uniformiza- 
tion property of Markov chains 8, Chapter 7.5]. The tran- 
sition probability matrix of the Uniformized Markov chain 
(with unitary uniformization parameter) at the embedded 
transition points is 

which is also the transition probability matrix of a PS pro- 
cess. □ 

6. RESULTS 

In this section we compare PS with SingleRW and Multi- 
pleRW. We also contrast PS with random vertex and edge 
sampling. The experiments consist of executing these sam- 
pling methods on a variety of real world graphs. The datasets 
used in the simulations are summarized in Table[T] "Plickr", 
"Livejournal", and "YouTube" are popular photosharing, 
blog (weblog), and video sharing websites, respectively. Users 
are represented as vertices of a graph. In these websites a 
user can subscribe to other user updates; an edge (m, v) ex- 
ists between users u and v if user u subscribes to user v. At 
"Livejournal" and "YouTube" it is possible to query the in- 
coming and outgoing edges of a given user. Further details of 
these three datasets can be found in . "Internet RLT" is 
a router-level Internet graph collected from traceroute mea- 
surements of 23 monitors distributed over the world [13| . 
Note that some of these graphs contain disconnected com- 
ponents (subgraphs). 

In the following simulations the starting vertex of each 
random walker is chosen uniformly at random from the set 
of all vertices. Our results show that PS estimates are consis- 
tently more accurate than their SingleRW and MultipleRW 
counterparts. Moreover, when restricted to the largest con- 
nected component, PS reaches steady state faster than Sin- 
gleRW and MultipleRW in the simulations presented in Ap- 
pendix [B] 

6.1 Assortative Mixing Coefficient 

In our first experiment we treat the graphs in Table \T\ 
as undirected graphs. In-degrees and out-degrees are repre- 
sented as vertex labels and the assortative mixing coefficient 
is obtained using the estimator described in Section [4.2.21 

In our experiment we average the estimates and calculate 
their mean squared error (MSB) over 100 runs. The sam- 
pling budget is |y|/100 for all graphs. Let r denote the 
estimated value of r. Table [2] shows a summary of the rel- 
ative bias of f (1 — E[r]/r) and f 's NMSE with respect to 
the true value of r. We observe that FS is consistently more 



accurate than both MultipleRW and SingleRW. If we focus 
on Plickr, the PS bias is 7 fold smaller than the bias of both 
MultipleRW and SingleRW. In addition FS's NMSE is one 
order of magnitude smaller than the NMSEs of MultipleRW 
and SingleRW. The Internet graph ("Internet RLT") is the 
only graph we studied that shows little difference between 
FS and MultipleRW. 

We also perform an extreme experiment that focuses on 
the impact of loosely connected components on the assorta- 
tive mixing estimates. Consider a graph that consists of two 
instances of a random undirected Barabasi- Albert [5] graph, 
Ga and Gb, with 5 x 10^ vertices each and average degrees 
2 and 10, respectively, joined by a single edge connecting 
the two smallest degree vertices in Ga and Gb (ties are re- 
solved arbitrarily). Henceforth, we use Gab to denote the 
above graph. It is worth noting that over the Gab graph, 
SingleRW consistently finds r — over all 100 runs. This 
is because SingleRW only estimated the assortative mixing 
of either subgraph A or subgraph B, which are both zero. 
Over Gab MultipleRW performs almost as bad as SingleRW 
while FS is able to accurately estimate r. 

6.2 In- and Out-degree Distribution Estimates 

We now focus on estimating the in-degree distribution. 
Let 9 = {6i}\fi£C denote the in-degree distribution, where 
9i is the fraction of vertices with in-degree i. In our simu- 
lations we estimate 7^ — X/fc^i+i ^1^' CCDF of 9, using 
equation 0. We choose to estimate the CCDF instead of 
the density because the CCDF is the plot of choice when it 
comes to displaying degree distributions. Each simulation 
consists of 10, 000 runs (sample paths) used to compute the 
empirical CNMSE (equation Q). The CNMSE is used to 
compare the accuracy of the estimates obtained from FS 
(dimension m £ {10, 1000}), SingleRW, and MultipleRW 
(m € {10, 1000} walkers). For the sake of conciseness, we 
restrict our presentation to a handful of representative re- 
sults. 
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Figure 3: (Flickr) Log-log plot of the in-degree CCDF. 

Consider first two representative results from the Flickr 
graph, whose in-degree CCDF (complementary cumulative 
distribution function) log- log plot is shown in Figure [3l The 
sampling budget is B = 17, 152 = jl/j/100, which amounts 
to sampling 1% of the vertices. In the first simulation, we 
are restricted to the Largest Connected Component (LCC) 
(which contains 94% of the vertices). The objective is to 
test if FS can outperform SingleRW and MultipleRW even 
when there are no disconnected components. Figure U shows 
a log-log plot of the CNMSE of PS (m = 1000), SingleRW, 
and MultipleRW (m = 1000). In this experiment FS out- 
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Figure 4: (LCC of Flickr) The log-log plot of the CNMSE of the 
in-degree distribution estimates with budget B = |y|/100. 
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Figure 5: (Flickr) The log-log plot of the CNMSE of the in-degree 
distribution estimates with budget B = |V^|/100. 



performs both SingleRW and MultipleRW. It is interesting 
to note that the estimates obtained by SingleRW are more 
accurate than the estimates obtained by MultipleRW. Now 
consider the complete Flickr graph. Figure [S] shows a log-log 
plot of the CNMSE of the in-degree distribution estimates. 
Contrasting the plots shown in Figures 0] and [S] we see that 
the gap between FS and both SingleRW and MultipleRW 
has significantly increased, favoring FS. 

To better understand the differences between these sam- 
pling methods, Figure [5] focuses on four runs (sample paths) 
of the simulation over the complete Flickr graph. Figure [S] 
plots the evolution of 9i (the estimate of 6i) as a function 
of n (the number of steps in the random walk) . At each run 
of the simulator both FS and MultipleRW start at the same 
initial set of vertices (chosen using random vertex sampling). 
Figure [S] shows that all four FS sample paths (runs) quickly 
converge to the value of 9i . For SingleRW, three of the four 
runs start inside the LCC. These runs do not converge to the 
value of Oi as some vertices with in-degree one lie outside 
the LCC. In one of the runs, SingleRW starts in a small dis- 



Figure 6: (LCC of Flickr) Four sample paths of 9i {9i = 0.53) as 
a function of the number of steps n (horizontal axis in log scale). 

connected component and, thus, grossly overestimates the 
value of 6i. For a similar reason, i.e., walkers starting at 
small disconnected components, MultipleRW grossly over- 
estimates the value of 6i. The MultipleRW rapid increase 
of 9i at around n = 10^ steps needs further investigation. It 
may be due to the transient of the random walk (discussed 
in Section [4. 4p . Even when n 2> 1 (not shown in Figure [6} 
the MultipleRW estimate is unable to converge to 9i. Modi- 
fying both SingleRW and MultipleRW methods to cope with 
disconnected components is an interesting open problem. 
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Figure 7: (Livcjournal) Log-log plot of the out-degree CCDF. 

For the sake of conciseness, we omit the results of the 
simulations over the remaining graphs (Table [T]) as they 
are similar to the results observed over the Flickr graph. 
However, consider the out-degree distribution estimates of 
Livejournal. Figure [7] shows a log- log plot of the CCDF of 
the out-degrees. The log-log plot of the CNMSE is shown 
in Figured for FS (m = 100), SingleRW, and MultipleRW 
(m = 100) with sampling budget B = \V\/W. From Fig- 
ure|5]we see that estimates of vertices with small out-degrees 



in FS are up to one order of magnitude more accurate than 
those obtained from both SingleRW and MultipleRW. 
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Figure 8: (Livejournal) The log-log plot of the CNMSE of the out- 
degree distribution estimation with sampling budget B = | V|/100 
(CNMSE over 10, 000 runs). 
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Figure 9: {Gab graph) Four paths of 6io as a function of the 
number of steps n (9-u) = 0.024). 

The next experiment focuses on studying the impact of 
loosely connected components on the degree distribution es- 
timates. For this we use the two Barabasi-Albert joined 
graphs Gab presented in Section r4.2.2l The experiment con- 
sists of estimating the degree distribution of Gab using FS 
(m = 100), SingleRW, and MultipleRW (m = 100). Again, 
both FS and MultipleRW start at the initial set of vertices 
in each simulation (chosen uniformly at random). In this 
experiment the hypothesis is that, for small sampling bud- 
gets, each random walker will see the degree distribution of 
either Ga or Gb but not the degree distribution of Gab- 
Moreover, as the starting vertex of each random walker is 
chosen uniformly at random, Ga, which has the same num- 
ber of vertices as Gs but 1/5 of the edges, receives more 
random walkers than its per edge "share". Consequently, 
MultipleRW oversamples Ga- 

Figure [9] shows the results of four simulation runs and 
plots the evolution of the estimates of 6io (^lo) as a function 
of the number of steps. In this simulation note that: (1) FS 
quickly converges to a value that is close to the correct value; 
(2) two out of the four SingleRW runs overestimate 6^10 and 
the remaining two underestimate it; (3) three out of the four 
MultipleRW runs converge to the same, incorrect, fraction 
(underestimating the true value of &10). FS is designed to be 
robust to disconnected or loosely connected components. All 
of the FS runs quickly converge to a good estimates of ^10. 
Figure nUl also shows that the CNMSE for FS is consistently 
lower than the CNMSE for SingleRW and MultipleRW. 
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Figure 10: {Gab graph) The log- log plot of the CNMSE of the 
degree distribution estimation with sampling budget B = | V|/100 
(CNMSE over 10,000 runs). 

6.3 FS V.S. Stationary MultipleRW & SingleRW 

We now compare FS with SingleRW and MultipleRW, 
when the latter two start in steady state. Figure fTT] shows 
the results (over the Flickr graph) of the same simulation 
scenario used to obtain the results in Figure [S] except that 
now MultipleRW and SingleRW both start in steady state. 
While SingleRW has improved slightly (most notably at the 
tail errors), the benefit of starting in steady state is most 
felt by the MultipleRW method. In this simulation we see 
that the large estimation errors of MultipleRW in the previ- 
ous simulations were due to the starting vertices being sam- 
pled uniformly at random. It is interesting to observe that 
MultipleRW starting in steady state and FS have similar 
estimation errors. 
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Figure 11: (Flickr) The log-log plot of the CNMSE of the 
in-degree distribution estimation of MultipleRW and SingleRW 
starting in steady state; sampling budget B = \V\/100 (NMSE 
over 10,000 runs). 

6.4 FS v.s. Random Independent Sampling 

In Section [3] we showed that, if the degrees of two neigh- 
boring vertices are independent, random edge sampling is 
more accurate than random vertex sampling when it comes 
to estimating the tail of the degree distribution. In this sec- 
tion we observe this to be true over large real world graphs; 
we also observe that the accuracy of FS closely matches the 
accuracy of random edge sampling. In the following Simula- 



tions we estimate the in-degree distribution. Random edge 
sampling uses the estimator 6i, equation ((7]) (the estimator 
used for sampled vertices is trivial). 

In our first simulation we set the sampling cost of ran- 
dom vertex sampling to one and random edge sampling has 
cost two (as each edge samples two vertices). The sam- 
pling budget is -B = |y|/100. We label this simulation 
"100% hit ratio" to indicate the unitary cost of randomly 
sampling vertices. Figure [12] shows a log-log plot of the 
NMSE (not the CNMSE ) of our simulation over the (com- 
plete) Flickr graph. Here we use the NMSE (instead of the 
CNMSE ) in order to be able to compare our results with 
the ones presented in equations (O and The vertical 
line indicates the average in-degree. Note that random edge 
sampling is more (less) accurate than random vertex sam- 
pling at estimating in-degrees larger (smaller) than the av- 
erage in-degree, as predicted by equations ^ and of our 
model in Section O We also observe that the accuracy of 
FS (m = 1000) closely matches the accuracy of random edge 
sampling. 

R-andom Edge Sampling (100% hit ratio) o 
FS (m = 1000) (100% hit ratio) + 
Random Vertex Sampling (100% hit ratio) □ 
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Figure 12: (Flickr) The log-log plot shows the NMSE of the in- 
degree distribution estimation with budget B = \V\/WO = 18612 
(CNMSE over 10, 000 runs). 
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Figure 13: (Livejournal) The log-log plot shows the CNMSE of 
the in-degree distribution estimation with budget B = |y|/100 = 
52844 (CNMSE over 10,000 runs). 

Some complex networks exhibit a sparse user-id space. 
In this scenario a fraction of the sampling budget B can 
be spent querying invalid users-ids. Motivated by recent 
experiments over the MySpace network ^SCij, the following 
experiment assumes that only 10% of the user-ids are valid, 



i.e., on average only one in every ten randomly sampled 
vertices are valid. We denote this value (10%) to be the hit 
ratio. For random edge sampling we assume a hit ratio of 1% 
(the choice of 1% is arbitrary). Figure [T^ shows a log-log 
plot of the CNMSE of our simulation over the (complete) 
Livejournal graph with sampling budget B = IV^I/lOO = 
52844. We observe that FS (m = 1000), which samples 
m = 1000 random vertices and (on average) crawls iJ — 10 m 
vertices, outperforms random edge sampling. Also note that 
FS estimates are more accurate than the estimates obtained 
from random vertex sampling for all but the three smallest 
in-degrees. This indicates that FS is more robust to low hit 
ratios than random vertex and edges sampling. 

SingleRW o 
FS (m = 100) + 
MultipleRW {k = lOO) □ 
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Figure 14: (Flickr) The NMSE of the density estimates of the 
most popular groups in the Flickr graph. 

6.5 Density of Special Interest Groups 

In a variety of complex networks, e.g. on-line social net- 
works, each vertex (user) is associated with multiple labels 
that represent group affiliations, e.g. user interests, user ge- 
olocation, among others. For example, in the Flickr graph 
21% of the users belong to one or more special interest 
groups [26]. Let £. denote the set of groups in the Flickr 
graph and 9i denote the fraction of vertices that belong to 
group I £ C In the simulations we estimate 6i using FS 
(m = 100), SingleRW, and MuhipleRW (m = 100) with 
budget B = |y|/100. Figure [Til shows the NMSE (from 
10, 000 runs) of the most popular 200 groups ordered in de- 
creasing popularity. FS is clearly superior to both SingleRW 
and MultipleRW. Even when restricting the random walks 
to the largest connected component, FS still noticeably out- 
performs MultipleRW (m = 100) and SingleRW. 
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Graph 
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FS 


SingleRW 


MultipleRW 


Flickr 

Livejournal 


1% 
1% 


0.14 
0.16 


0.13 (0.04) 
0.16 (0.02) 


0.12 (0.33) 
0.16 (0.02) 


0.16 (0.18) 
0.17 (0.06) 



Table 3: Global clustering coefficient estimates. C is the true 
value of the global clustering coefficient and C is its estimated 
value. 

6.6 Global Clustering Coefficient Estimates 

In our last set of experiments we evaluate the accuracy 
of estimating the global clustering coefficient using FS, Sin- 
gleRW, and MultipleRW. Our simulations show a small dif- 
ference between FS (m = 1000), SingleRW, MultipleRW 



(m — 1000). Let C be the clustering coefficient and C de- 
note its estimated value. Table [3] presents the empirical 
value of E[C] and the empirical NMSE of the clustering co- 
efficient, given by 

o 

over 10, 000 runs of FS, SingleRW, and MultipleRW over the 
Flickr and Livejournal graphs. From the results of Table O 
we see that FS accurately estimates the global clustering 
coefficient and has smaller error than both SingleRW and 
MuhipleRW. 

7. RELATED WORK 

This section is devoted to review the related literature. FS 
can be classified as a Markov Chain Monte Carlo (MCMC) 
method. Other MCMC-based methods have been applied to 
characterize complex networks. Applications include, but 
are not limited to estimating characteristics of a popula- 
tion [35] (e.g. estimation of HIV seroprevalence among drug 
users |24)). content density in peer-to-peer networks [1611231 
1291 134j . uniformly sampling Web pages from the Internet |17l 
132] . and uniformly sampling Web pages from a search en- 
gine's index [J. The above literature is mostly concerned 
with random walks that seek to sample vertices uniformly 
(also known as Metropolized Random Walks or Metropolis- 
RW) 16, 13 ism EH. The accuracy of RW and Metropolis- 
RW (MRW) is compared in |15l 129] . and in a variety of ex- 
periments RW estimates are shown to be consistently more 
accurate than or equal to MRW estimates. 

The above literature does not consider the use of mul- 
tiple random walks to address the problem of estimating 
characteristics of disconnected or loosely connected graphs. 
While multiple independent random walkers have been used 
as a convergence test in the literature, our simulations in 
Section [B] show that independent walkers are not suited to 
sample loosely connected graphs when the starting vertices 
are selected uniformly at random. 

A number of real complex networks are known to have dis- 
connected or loosely connected components. A large body 
of MCMC literature is dedicated to overcome the locality 
problem described in Section 14.31 However, the literature 
either assumes that the graph is very structured, e.g., a 2 
dimensional lattice, or that the graph is completely known. 
These assumptions make the solutions inapplicable to our 
problem. A comprehensive list of MCMC methods and their 
characteristics can be found in |31] . 

Projecting a RW onto a higher dimensional space has been 
used in [9| to make the Markov chain associated to the ran- 
dom walker nonreversible, which can speed up the mixing of 
the original RW. Unfortunately, it is unclear if this method 
can be successfully used to estimate characteristics of com- 
plex networks. 

In networks that cannot be crawled (e.g., the Internet 
topology), samples must be obtained along shortest paths, 
and vertex degrees cannot be queried, iJj shows that ob- 
served vertex degrees are biased. Our work, however, as- 
sumes a graph can be crawled and vertex degrees queried. 
Our scenario admits a RW with an unbiased estimator. Mul- 
tiple random walks also find other applications besides the 
one presented in this work. They are used to collect Web 
data [To], search P2P networks [6] [32], and decrease the time 



to discover "new wireless nodes" [5]. Dependent multiple 
random walks are also used in percolation theory 0. 

8. DISCUSSION AND FUTURE WORK 

In this work we presented a new and promising random 
walk-based method {Frontier sampling) that mitigates the 
estimation errors caused by subgraphs that "trap" a random 
walker. Frontier sampling (FS) uses multiple (m) mutually 
dependent random walkers starting from vertices sampled 
uniformly at random. The FS samples are shown to be the 
projection (onto the original graph) of a special type of m- 
dimensional (single) random walker. Simulations over real 
world graphs in Section [6] show that Frontier sampling (FS) 
is more robust than single and multiple independent random 
walkers (starting out of steady state) to estimate in-degree 
distributions and the fraction of users that belong to a so- 
cial group. We also present evidence, using an analytical 
argument (also substantiated by simulations), that random 
walks (in particular, FS) are better suited to estimate the 
tail (all degrees greater than the average) of degree distri- 
butions than random vertex sampling. FS can also be made 
fully distributed without incurring in any coordination or 
communication costs. 

The ideas behind FS can have far reaching implications, 
from estimating characteristics of dynamic networks to the 
design of new MCMC-based approximation algorithms. 
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APPENDIX 

A. PROOF OF THE FRONTIER SAMPLING 
THEOREM 

In what follows we restate Theorem 15.21 and present a 
proof. 

Theorem. Recall that G is a directed symmetric graph. If 
G is connected and non-bipartite, then in steady state FS 
has the following properties: 

(I) edges are sampled uniformly at random and form a sta- 
tionary sequence, 

(II) has distribution, Laa ~ {v\, . . . ,Vm), equal to 
m|l/!™-ivolCl/) ' ' 



Graph 


B (sampling budget) 


Sampling prob. 
FS MRW 


error 
SRW 


Internet RLT 


100 


17% 257% 


156% 


YouTube 


20 


43% 236% 


216% 


Hep-Th 


20 


36% 1510% 


781% 



Table 4- Relative worst-case difference between the steady state 
and the transient edge sampling probabilities after B — K steps. 
Frontier edge sampling probabilities are closer to steady state in 
all graphs. Legend: (FS) = Frontier sampling {K = 10), (SRW) 
= Single (K = 1) Random Walker, and (MRW) = Multiple [K = 
10) Random Walkers. 

which is unique, and 

(III) the sequence of sampled edges satisfies the Strong Law 
of Large Numbers (Theorem \4- -?| ). 

Proof. Consider the (n — l)-st step of Frontier sampling. 
The reader may find Figure [2] helpful in following the proof. 
Let L„ — (vi, . . . , Vm) be the state of Frontier sampling be- 
fore the n-th step. Clearly Ln € V"^. In what follows let 
e{L„) denote the collection of all edges associated to the 
vertices in L„. We refer to e(L„) as the edge frontier at 
the n-th step. We describe the transition from state L„ to 
state Ln+i as follows (lines [4] and [5] of the frontier sampling 
algorithm): Select a vertex v £ Ln with probability pro- 
portional to deg(u) and then replace element v in Ln with 
one of its neighbors (selected uniformly at random). This is 
equivalent to randomly sampling an edge from e(L„) with 
probability 

_ 1 _ 1 

^-Ri;:^-^,^^,^deg(.)- 

Thus, Ln is able to transition to state I/„+i iff {Ln, Ln+i) G 
Em and the transition probability from L„ to L„+i is l/\e{Ln)\ 
Thus, we conclude that Frontier sampling is a single ran- 
dom walker over the m-th Cartesian power of G, = 
{V" 



,)\vi ev A--- ev} 



is the m-ary Cartesian product of V and Vv, u G , (v, u) G 
E,n if exists an index i such that {vi,Ui) G E and Uj = Vj 
for j ^ i. Note that \En,\ = m|l/r-i|J5|. 

Now we need to prove that the distribution of Loo is sta- 
ble and unique. For this we only need to show that the 
random walk over G"" is ergodic. A random walk (Markov 
chain) is ergodic when it is aperiodic and recurrent non-null. 
Recall that the random walk over G is ergodic. The prob- 
ability that Frontier sampling transitions from L„ G V"^ 
to Ln+i G V"^ such that L„ and L„+i only difler in their 
i-th element is always greater than zero, otherwise there 
is an infinite increasing degree sequence in the vertices of 
G. But this is not possible as the random walk over G is 
recurrent non-null (an infinite increasing degree sequence 
would be a sink in the random walk over G). Thus, any 
finite sequence of transitions {L„+i„}^^i, A > 1, that only 



updates its i-th element has probability greater than zero. 
Thus, as the sequence {Ln+w}w^i is also a single random 
walk over G, it is aperiodic for any chosen i — 1, . . . ,m and 
then a random walker over G™ must also be aperiodic. We 
can use the same argument to show that the random walk 
over G™ is recurrent non-null. As random walk over G™ 
is ergodic, we have that L* is distributed according to the 
steady state distribution of a random walk over G™. Let 
Loo = lim„_>oo in denote the state (vertex in G™) when 
FS is in steady state. The number of edges out of vertex 
Loo G V" is X;™ideg(uO- The number of edges in G™ 
is m\V\"^^^vol{V) as each vertex v £ V appears m|y|™'~"'^ 
times in G". Thus, 



P\Lo 



0] 



l]llidcg(t'0 
mll/h-ivo^V) ' 



which is unique and stable (similar to a single random walker 
as seen in Section Q. 

The rest of the proof is straightforward. Each edge in 
G"" is actually an edge in G. As each edge in G is copied 
m|V^|™~^ times into G™, we have that edges in G are also 
sampled uniformly at random in a random walk over G™. 
As Frontier sampling is a random walk over G™ , its samples 
form a stationary sequence and follow the Strong Law of 
Large Numbers seen in Theorem 14.11 The same is true for 
the sequence of sampled vertices. □ 



B. CONVERGENCE TO UNIFORM EDGE 
SAMPLING 

A question we seek to answer with our simulations is how 
fast these random walk methods converge to their stationary 
edge sampling probabilities. In this simulation we set K G 
{1, 10} (number of independent random walkers), m = 10 
(Frontier sampling dimension) and restrict our analysis to 
the largest connected component of the three graphs in our 
datasets with the smallest number of vertices (in order to 
speed the computation): "Internet RLT", "YouTube", and 
"Hep-th". Let pl^y denote the probability that a random 
walker, whose initial vertex is chosen uniformly at random, 
samples edge {u,v) at its the end of its sampling budget B. 
To measure the convergence to the stationary edge sampling 
probability, we use the largest relative difference between the 
stationary sampling probability 1/\E\ and pi%': 



max 1 



(B) 
Pu,v 



Table 2] presents a Monte Carlo estimate of this relative dif- 
ference. The 95% confidence interval of the Monte Carlo 
simulation is ±1%. Our estimates show that the differ- 
ence between the transient and the stationary edge sampling 
probabilities of independent random walkers are between 5 
and 42 times larger than the difference of Frontier sampling. 
This means that Frontier sampling converges faster to sta- 
tionarity edge sampling probability. 



